IETF language tag

IETF language tags are abbreviated language codes; for examples: "en" for English, "pt-BR" for Brazilian Portuguese, or "nan-Hant-TW" for Min Nan Chinese as spoken in Taiwan using traditional Han characters. They are defined by the BCP 47 standard track, which is currently composed of normative RFC 5646 (referencing the related RFC 5645) and RFC 4647, along with the normative content of the IANA Language Subtag Registry^[1] ^[2] ^[3] . Components of language tags are drawn from ISO 639, ISO 15924, ISO 3166-1, and UN M.49.

These language tags are used in a number of modern computing standards, including those from the IETF related to the Internet protocols such as HTTP^[4], those from the W3 Consortium such as HTML,^[5] XML^[6] and PNG^[7], and those from other private standardization bodies such as SGML or Unicode (in some of its standard annexes), or from national or regional standard bodies like ANSI or ECMA (for example in some of their standards related to computing languages, or to bibliographic references and documents classification used in institutional libraries).

1 History
2 Syntax of language tags
3 Relation to other standards
4 See also
5 Notes and references
6 External links

History

IETF language tags were first defined in RFC 1766, published in March 1995. The tags used ISO 639 two letter language codes, ISO 3166 two letter country codes, and allowed variant or script tags of three to eight letters.

In January 2001 this was superseded by RFC 3066, which added the use of ISO 639 part 2 three letter codes, permitted subtags with digits, and adopted the concept of language ranges from HTTP/1.1 to help with matching of language tags.

The next revision of the specification came in September 2006 with the publication of RFC 4646 (the main part of the specification) and RFC 4647 (which deals with matching behaviour). RFC 4646 introduced a more structured format for language tags, added the use of ISO 15924 four letter script codes and UN M.49 three digit geographical region codes, and replaced the old registry of tags with a new registry of subtags. The small number of previously defined tags that did not conform to the new structure were grandfathered in order to maintain compatibility with RFC 3066.

The current version of the specification, RFC 5646, was published in September 2009. The main purpose of this revision was to incorporate three letter codes from ISO 639 parts 3 and 5 into the Language Subtag Registry, in order to increase the interoperability between the ISO 639 and BCP 47 standards.^[8]

Syntax of language tags

Each language tag is composed of one or more "subtags" separated by hyphens (-). Each subtag is made with basic Latin letters or digits only.

With the exception of private-use language tags beginning with an "x-" prefix and of grandfathered language tags (including those starting with an "i-" prefix and those previously registered in the IANA database for language tags), subtags occur in the following order:

a single primary language subtag composed of a two letter language code from ISO 639-1 (2002), or a three letter code from ISO 639-2 (1998), ISO 639-3 (2007) or ISO 639-5 (2008);
up to three optional extended language subtags composed of three letters each, separated by hyphens; (There is currently no extended language subtag registered in the IANA database without an equivalent and preferred primary language subtag. This component of language tags is preserved for backwards compatibility and to allow for future parts of ISO 639.)
an optional script subtag, composed of a four letter script code from ISO 15924 (usually written in title case);
an optional region subtag composed of a two letter country code from ISO 3166-1 alpha-2 (usually written in upper case), or a three digit code from UN M.49 for geographical regions;
optional variant subtags, separated by hyphens, each composed of five to eight letters, or of four characters starting with a digit; (Variant subtags are registered with IANA and not associated with any external standard. ISO 639-6 (2009) four letter language variant codes do not meet this specification and thus are not eligible for variant subtag registration.)
optional extension subtags, separated by hyphens, each composed of a single character, with the exception of the letter x, and a hyphen followed by one or more subtags of two to eight characters each, separated by hyphens; (No extension subtags have yet been registered; they are reserved for future standardization.)
an optional private use subtag, composed of the letter x and a hyphen followed by subtags of one to eight characters each, separated by hyphens.

Subtags are not case sensitive, but the specification recommends using the same case as in the Language Subtag Registry, where region subtags are uppercase, script subtags are titlecase and all other subtags are lowercase. This capitalization follows the recommendations of the underlying ISO standards.

Optional subtags are preferred to be omitted when they add no distinguishing information to a language tag. For example, "es" is preferred over "es-Latn", as Spanish is fully expected to be written with Latin script; "arb" is preferred over "arb-Arab", as Modern Standard Arabic is understood to be written with Arabic script.

Region subtags are often deprecated by the registration of specific primary language subtags from ISO 639-3 which are now "preferred values". For example, "ar-DZ" is deprecated with the preferred value "arq" for Algerian Spoken Arabic; "arq-DZ" is also deprecated, as the country code adds no further distinction. Most regional differences in languages are interpreted as differences of dialect, rather than being purely regional.

Not all linguistic regions can be represented with a valid region subtag: the subnational regional dialects of a primary language are currently registered specifically as variant subtags. For example, the "valencia" variant subtag for the Valencian dialect of Catalan is registered in the IANA database with the restricting language tag prefix "ca". The region subtag "ES" is implicit for this dialect spoken in two autonomous regions of Spain.

IETF language tags have been used as locale identifiers in many applications. It is recommended that other means be used for defining, encoding and matching locales as this is out of scope of the IETF BCP 47 standard track.

The use, interpretation and matching of IETF language tags is currently defined in RFC 4647 in combination with RFC 5646 and RFC 5645. The Language Subtag Registry, maintained by IANA, lists all currently valid public subtags. Private use subtags are not registered in the IANA database as they are implementation-dependent and subject to private agreements between third-parties using them. These private agreements are out of scope of the BCP 47 standard track.

Relation to other standards

Although subtags are often derived from ISO standards, they do not follow these standards absolutely, as this could lead to the meaning of language tags changing over time.

In particular, a subtag derived from a code assigned by ISO 639, ISO 15924, ISO 3166 (or UN M.49 only for supranational geographical regions) remains a valid (though deprecated) subtag even if the code is withdrawn from the corresponding ISO standard. If the ISO standard later assigns a new meaning to the withdrawn code, the corresponding subtag will still retain its old meaning.

This stability was introduced in the (now obsolete) RFC 4646 (and confirmed in its successor). Before RFC 4646, changes in the meaning of ISO codes could cause changes in the meaning of language tags.

Relations to ISO 639-3 (individual languages and macro-languages) and some parts of ISO 639-1

The obsoleted RFC 4646 (as well as its current successor RFC 5646), unlike its predecessors, defined the concept of an "extended language subtag", although it still did not permit the registration of such subtags.^[9]^,^[10]

However, in the newer RFC 5645 and RFC 5646, all individual languages and macro-languages of ISO 639-3 were finally registered as (primary) language subtags, with a new language matching algorithm that allows a resource whose localization is missing in an individual language to be looked for in its macro-language, whose code is now present in the IANA database along with other classification information coming from ISO 639-3 (and also ISO 639-5 for language families).

The new version of the specification still allows certain codes to be registered as extended language subtags. Most of them however are for individual languages that are members of a macro-language, for which the language subtag for the macro-language is used as the "prefix" tag, and the ISO 639-3 code of the individual language is used as an extension language subtag, valid only for this prefix; a "preferred" value is also defined in this case, which replaces the combination of these subtags by just the single subtag for the individual language, so effectively the language-extlang tag is now defined as an alias to the shorter (primary) language tag for all individual languages defined in ISO 639-3 (or defined with the even shorter alpha-2 codes already inherited from ISO 639-1, where the alpha-3 codes of ISO 639-2 and ISO 639-3 were also defined as aliases to the alpha-2 code, when they existed for that individual language).

Additionally, some legacy full language tags that used combinations of subtags, that do not follow this pattern (but were defined for example using a IANA specific prefix like "i-" followed by a specific subtag), but that were used to refer to individual languages that have now been encoded in ISO 639-3, have been "grandfathered", meaning that they are no longer recommended even if they remain valid:

For example, the language tag "zh-min-nan" for the Minnan language is also defined as a "grandfathered" language tag, with also a preferred value using the newer shorter (primary) language tag (such as "nan" in this case).
Another example is the Hakka language, which was previously defined with the IANA prefix as "i-hak", is now a grandfathered language tag, whose preferred value is now just "hak" (coming from ISO 639-3), and whose display name is now "Hakka Chinese". A new alias for the same language was also added from "zh-hak" to "hak" using the standard (prefix-based) rules for valid extension language subtags.
Similar "grandfathered" combinations of a primary language subtag with region subtags were also aliased to a preferred (primary) language tag, notably for the various (national) sign languages that were also encoded as individual languages in ISO 639-3 and added to the IANA registry.

Relations to ISO 639-5 (language collections) and some parts of ISO 639-2

ISO 639-5 defines language families with alpha-3 codes in a different way than the codes that were initially encoded with alpha-2 codes in ISO 639-2 (including also one code already present in ISO 639-1). Notably, the language collections are now all defined in ISO 639-5 as inclusive, rather than being defined exclusively. This means that language collections have a broader scope than before, in some cases where it could encompass some other languages that were already encoded too separately within ISO 639-2.

To avoid breaking the implementations that may still depend on the older (exclusive) definition of these collections, a grouping type attribute has been added for all collections that were already encoded in ISO 639-2 (such grouping type is not defined for the newed collections added only in ISO 639-5), within the ISO 639-5 standard.

But this property has still not been added, for now, in the IANA database for language subtags, which instead defines the "Language"-type entries for these primary language tags, with a "Scope" property just equal to "collection", without making any distinction for them when the inclusive/exclusive definition could be significant. It is still left to implementations to choose how they will match collections of individual languages or macro-languages. As a consequence, BCP 47 language tags that include primary language tags of collections cannot safely be interpreted using the newer inclusive definition of language families introduced in ISO 639-3 and ISO 639-5.

The newer alpha-3 codes for these collections were also defined as aliases to a preferred value using existing alpha-2 codes, when they were already encoded like this (this concerns for now only one alpha-3 code in ISO 639-2 which was already encoded as a primary language subtag, but aliased to a preferred value that was also already encoded in ISO 639-1 with a shorter alpha-2 code, but it may happen in the future for other subtags, if a primary language subtag currently defined with the "Scope" property of a macro-language is later redefined with the "Scope" of a collection that would be later added into ISO 639-5 and deprecated from ISO 639-3).

It should be noted that the language collections currently defined in ISO 639-5 are very broad and are not as precise as what can be found in linguistic resources like The Ethnologue (published by SIL International) or even more precisely by The Linguist List (LinguistList.org). A comprehensive and more precise classification of language families is still a work in progress, as well as the classification of languages or macro-languages within these collections, so this is still not standardized.

Neither the ISO 639 standard nor the BCP 47 currently defines precisely which of the languages or macro-languages currently encoded in ISO 639-3 (and parts of ISO 639-2 or ISO 639-1) are members of these collections : only the hierarchical classification of existing collections already encoded in ISO 639-5 is encoded, using the newer inclusive definition of these collections (and not the legacy exclusive definition inherited from ISO 639-2 or ISO 639-1).

This means that using the code of language collections for language identification within BCP 47 is still risky and still cannot be recommended (except for individual languages which are currently not identified, or still unencoded), until a further revision of the related standards is applied and published (this may take several years, or could even never happen if linguists do not solve their disagreements about some languages, notably with languages that are perceived by some linguists as creoles or pidgins and by others as individual languages belonging principally to one family, or for languages that have progressively evolved from one family to another one from which it has borrowed lots of terms and constructions).

On the opposite, the classification of individual languages within their macro-language is now standardized (in ISO 639-3, as well as in the IANA database for BCP 47 language tags).

Relations to ISO 15924 (script subtags), ISO/IEC 10646 and Unicode

Script subtags were already added in RFC 4646, from the list of codes that were defined in ISO 15924. They were already in the IANA database when the revision RFC 5646 was published. They are maintained and can now be encoded more safely after region subtags but still before variant subtags for dialects or orthographic variants of the same individual language written with the same script.

In addition, some primary language subtags are now defined with a property named "Suppressed Script" which indicates the cases where a single script can be safely assumed by default for the language, even if it can be written with another script. In this case, the combined language-script language tag is effectively an alias (declared "redundant" in the IANA database) to just the language subtag. A different script subtag can still be appended to make the distinction, when necessary. For example "yi-Hebr" is aliased to the preferred value "yi", because the generic Hebrew script code is assumed for the Yiddish language.

Some more complex aliasing patterns have also been studied. For example "zh-Hans-SG" is now aliased (as "redundant") to the preferred value "zh-Hans", because the region code is not significant, and the written form used in Singapore uses the same simplified version of the sinograms as used in other countries where Chinese is written. However the script variant code is maintained because it is significant.

Note that ISO 15924 includes some codes for several script variants (for example "Hans" and "Hant" for simplified and traditional forms of ideographs) that are unified within Unicode and ISO/IEC 10646. These script variants are most often encoded for bibliographic purpose, but are not always significant from a linguistic point of view (for example "Latf" and "Latg" script codes for the Fraktur and Gaelic variants of the Latin script, which are unified within "Latn" in Unicode and ISO/IEC 10646). Some of these script subtags may be aliased to a preferred value within BCP 47 and in the IANA database, if necessary (but they are still part of the IANA registry and not aliased there, because they may expose orthographic and possibly semantic differences, with different analysis of letters, diacritics, and digraphs/trigraphs as default grapheme clusters, or differences in letter casing rules).

Past issues with ISO 3166-1 and UN M.49 (region subtags)

Further information: The CS controversy

If a new ISO 3166-1 alpha-2 code would conflict with an existing region subtag (due to the code having previously had a different meaning), a UN M.49 code can be used instead. This rule was introduced in the now obsoleted RFC 4646, but maintained in its current successor RFC 5646. UN M.49 is also the source for region subtags such as 005 for South America, as ISO 3166 does not provide codes for supranational regions.

In addition, most languages that needed a region subtag in RFC 4646 (and previous versions) are no longer needing it with their new preferred value (using a single primary language subtag, from ISO 639-3). The now preferred mechanism is not to use region subtags, but rather to register and use "variant" subtags for regional variants of a language, when they are now perceived as dialects of the same language, or to request the encoding of newer codes in ISO 639-3 for individual languages (possibly transforming an existing primary language that encompassed these variants into a macro-language), or using a script subtag when the difference is just the script and the same language is written the same way in multiple countries using the same script variant.

Some past language tags that used a region subtag which can be assumed by default for the language (on a linguistic point of view where just the language needs to be identified, and not other localization data), are also declared in the IANA registry as "redundant" tags. For example "de-DE" is redundant and aliased to the preferred value "de" (using just the preferred primary language subtag).

Further works will occur in the future for the comprehensive encoding of language variants and dialects in a new part of ISO 639 (and then in a later revision of RFCs assigned in the BCP 47 standard track).

Notes and references

External links

BCP 47 Language Tags – current specification (contains two RFCs, RFC 5646 and RFC 4647 published separately at different dates, but concatenated in a single document)
- (also referencing the related informational RFC 5645, which complements the previous informational RFC 4645, as well other individual registration forms published separately by others for each language added or modified in the registry between these BCP 47 revisions)
Language Subtag Registry – maintained by IANA
Language tags in HTML and XML – from the W3C
http://www.langtag.net/
IANA Language Subtag Registry Search – an unofficial tool for users to find subtags and view entries in the registry